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of the fetal heart structure. Moreover, the manual task for identifying the 
standard cardiac planes, primarily based on a four-chamber view, requires a 
Keywords: well-trained clinician and experience. In this paper, a CNN method using U- 
Net architecture was proposed to automate fetal cardiac standard planes 
segmentation from ultrasound images. A total of 519 fetal cardiac images 
was obtained from three videos. All data is divided into training and testing 
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Segmentation data. The testing data consist of 106 slices of the four-chamber segmentation 
Ultrasound images tasks, i.e. atrial septal defect (ASD), ventricular septal defect (VSD), and 
U-Net normal. The segmentation of the post-processing method is needed to 


enhanced the segmentation result. In this paper, a combination technique 
with U-Net and Otsu thresholding gives the best performances with 99.48%- 
pixel accuracy, 96.73% mean accuracy, 94.92% mean intersection over 
union, and 0.21% segmentation error. In the future, the implementation of 
Deep Learning in the study of CHDs holds significant potential for 
identifying novel CHDs in heterogeneous fetal hearts. 
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1. INTRODUCTION 

In developing countries such as India and Pakistan, about 1% of several newborn babies are affected 
by congenital heart defects (CHDs). The number of newborn babies with CHD is increasing as reported in 
[1], wherein 2011, the ratio of CHD sufferers per 1000 births was 9.1%. The CHDs is a heart disease that has 
been detected as early as the first trimester of intra-uterine life [2]. Such defects are characterized by 
abnormalities in the heart structure, with varying degrees from mild to severe. CHDs still dominate the 
problem of heart disease in infants and children. With the incidence of CHDs at around 1% of the infant in 
Indonesia and every year, there will be about 45,000 babies with CHDs ranging from mild to severe 
abnormalities diagnosis, including complex conditions [3]. A newborn with undiagnosed heart disease will 
be discharged to the home, and it is only a matter of time to get a worse condition or even dies at home. 
Diagnosis of CHDs problems in the early stage of pregnancy allows for prompt, lifesaving treatment. Fetal 
diagnosis depends on observations by experienced clinicians using ultrasound imaging [4]. Unfortunately, 
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due to very few experts in the field of CHDs, it is common for an infant to be born without having an existing 
heart problem diagnosed [5]. Undetected CHDs are a serious problem: when an infant has a serious heart 
problem, often the outcome depends on an accurate diagnosis at the time of birth. 

Newborn children with severe heart disease who are not analyzed before birth could in the first 
month or get more severely ill while still in the maternity ward. Nonetheless, treating intrinsic heart within 
seven days after birth mainly improves the prognosis [6]. In this way, numerous endeavors have been made 
to build up an innovation that makes fast and exact conclusions conceivable. A precise fetal heart 
segmentation is basic to localizing structural heart abnormalities before birth [7]. The difficulty increase due 
to the small size of the fetal heart structure and depressions, particularly the flimsy chamber boundaries in the 
atrial septum, the membranous section of the ventricular septum, and the valvular leaflets [8]. 

A machine learning (ML) algorithm provides the ability to learn the medical image data based on 
statistical techniques [9], [10]. Using medical data allows ML to improve its performance on a specific task 
progressively. Furthermore, the ML algorithm allows a diagnostic system to detect disease faster and more 
accurately than a human being [11]. Unfortunately, this process requires more information on normal and 
abnormal subjects to recognize a particular disease [12]. The problem is that heart defects in infants are 
infrequent, so there is a lack of available information to train the ML algorithm. The same issue applies to 
congenital heart disease: the problems are rare (there were no complete data sets), so predictions can only be 
made using relatively small and incomplete data sets [13]. According to this limitation, a diagnosis based on 
the ML-based method was insufficient in accuracy and does not recommend being used clinically [14], [15]. 
Adding more ultrasound images to the system can help the ML learn better to improve its screening accuracy 
[16]. The study of CHDs has been primarily conducted with handcrafted features, referred to as ‘shallow 
learning’ [17], [18]. Maraci et al. [18] applied dynamic texture modelling with handcrafted, rotation-invariant 
features to detect the fetal heartbeat. Bridge, Loannou, and Noble [17] proposed a framework based on 
sequential bayesian filtering (SBF) to predict visibility, position, and orientation of a fetal heart in 
consecutive frames. However, shallow architecture cannot learn the essential features directly from the data; 
because it requires feature engineering [19], [15]. 

One of the ML algorithms, called deep learning (DL) algorithm, can work using less data [20]-[22]. 
Using the augmented strategy, the image can be expanded and can be used as learning data as well [23]. The 
DL algorithm is vastly improving medical diagnoses speed and accuracy [12], [24]. In general, fetal heart 
diagnosis experts seek to find whether certain parts of the heart, such as valves and blood vessels, are in 
incorrect positions by comparing normal and abnormal fetal heart images. By using DL, such a process is 
similar to the ‘object detection’ technique, which allows DL to distinguish position and classify multiple 
objects appearing in images [25]-[28]. Using the ML approach, it is possible to develop an automatic 
diagnostic system to detect certain diseases faster than humans [12], [29]. Since there is limited research in 
automated fetal heart segmentation according to the aforementioned literature review study. This study 
proposed a DL algorithm to detect abnormalities in fetal hearts based on US images. By providing an 
accurate detection, an appropriate treatment can be conducted as soon as possible. The DL with automatic 
feature learning ability indicates an efficient method for fetal pattern recognition. Baumgartner et al. [26] 
employed a fully convolutional network to detect 12 standard planes and localize the respective fetal 
anatomy. Gao, Maraci, and Noble [27] presented a transfer learning-based design to study the transferability 
of features learned from natural images to ultrasound image object recognition. Chen et al. [28] proposed a 
hybrid model composed of ConvNets and recurrent neural networks (RNN) to explore spatiotemporal 
learning from contextual temporal information. However, the cardiac fetal data set is difficult to obtain, and it 
is lack of large publicly-available. Therefore, the use of the 3D Ultrasonographic (US) video of from the 
internet have becomes the data resource in [30]-[32]. Processing the US video to provide a reliable data set is 
a challenging task because the data is quite unstructured with varying dimensions. Hence, in this paper, the 
preliminary study of Deep Learning with convolutional neural network-based U-Net architecture is proposed 
to automated semantic segmentation the four-chamber view of fetal cardiac image with unstructured and 
varying of dimensional data. 


2. SEGMENTATION PROCESS 

There are three stages in the segmentation process: (1) data collection and preparation, (2) 
automated segmentation by using CNNs, and (3) Post-processing and calculating segmentation 
performances. A detail of these processes is presented in the following subsections: 


2.1. Data collection and preparation 


The data were obtained in MP4 format from three maternity in the gestational age range of 18-23 
weeks. The raw data collected in terms of normal [7], ASD [33] and VSD [34]. The MP4 format is converted 
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into frames/slices of the image. All data generated about 519 images in total with different dimensions. The 
dimensions of ASD, VSD, and Normal data were 1280x720, 640x480, and 640x480. Before the ground-truth 
data were obtained, we performed the data pre-processing stage, including cropping and resizing process. 
The overall steps of data preparation stages showed in Figure 1. 


Input Video Frame Conversion Cropping image Resizing Image Preprocessed Image 





Figure |. Data preparation stages 


All images must be cropped to eliminate the noise from the frame. After that, the cropping process is 
done to reduce the size of the image dimensions to 400x300. Once the pre-processed data has been obtained, 
the next step is to create ground truth for all images. The purposes of creating ground truth are to use it as 
annotated data in training phases and to measure segmentation results. Figure 2 is the illustration of pre- 
processing steps and the ground truth process. 





(a) (b) 


Figure 2. These figures are, (a) pre-processed data, (b) ground truth result 


2.2. Automated semantic segmentation 

The purpose of semantic segmentation is to find the specific characteristics in a cardiac fetal image 
and separate the various objects to become objects with specific characteristics. In this study, convolutional 
neural networks (CNNs) with U-Net architecture was proposed for cardiac fetal segmentation [35]. The 
U-Net architecture has been proven to carry out semantic segmentation processes in medical data sets, and it 
has been used successfully in brain tumour segmentation [36] and cell segmentation [37]. This architecture is 
an end-to-end fully convolutional network type architecture containing a convolution layer without a fully 
connected (dense) layer. Therefore, such architecture can accept various images with different sizes. 

In general, CNN-based U-Net architecture has two layers: the convolution layer and the pooling 
layer. The convolution layer processes the value of a matrix (kernel or filter) and changes it based on the 
filter's values. The convolution process is defined using (1): 


G|m,n| — (f * h)[m, n] = Èi Èk Alj, k]f [m —j,n — k] (1) 


where [m,n] is the input image, and f is a filter. The process of convolution operation is calculated by using 
(2): 


Nint2p—k 


our = |e Pl +4 (2) 

where n;, is the input features, Nout the output features, k a convolution kernel size, p a convolution padding 
size, and s a convolution stride size. The U-Net model uses max pooling on the pooling layer. This pooling 
function reduces the size of the feature map so that there are only a few parameters in the network. The max- 
pooling process selects the maximum pixel value from the feature map and then obtains the collected feature 


map. In the pooling process, the critical point is that the image size resolution is reduced because high- 
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resolution images are converted to low-resolution images. The U-Net architecture is divided into two parts, 
the encoder/contraction path and the decoder/symmetric expansion path, as presented in Figure 3. 

The function of contraction path or encoder is to capture the context contained in the image. In the 
first part, an image with dimension 256x256x1 will be sampled to produce the specified context. The 
contraction path consists of several convolutions and max-pooling layers, while the size of the convolution 
Kernel is 3x3. A non-linearity ReLU operation always follows in every convolution operation. Besides, there 
are pooling operations measuring 2x2 with a shift of 2 times. Finally, this process produces 32x32x256 
images. The second part is the symmetric expanding path or decoder to localize objects using convolution 
transformations. The symmetric expanding path stage contains the up-sampling operation of the results of the 
contraction path. In this operation, the image size will increase gradually, and the depth of the image will 
gradually decrease from 32x32x256 to 256x256xl. The symmetric expanding path process restores 
information generated from the contraction path process by slowly performing up-sampling stages. The skip- 
connection process is carried out at each symmetric expanding path layer to produce better object 
segmentation results. Such a process is done by combining the convolution layers output in the symmetric 
expanding path, which is transformed by the contraction path stage's feature-map at the same level. 









1 64 a m 
128 
64 64 
256 256 256 
128 128 256 128 256 
2 512 512 2 = . 
236 236 «(62 31 512 312 ë s2 256 51:2 
512 512 1024 1024 1024 1024 512 1024 


InputLayer Conv2D MaxPooling2D Dropout UpSampling2D Concatenate 





Figure 3. U-Net architecture parts 


2.3. Post-processing and performance measurement 

In this study, we carried out several post-processing methods to enhance the segmentation result. In 
the post-processing phase, each image pixel value becomes O for the background and 1 for the foreground. 
This process improves the quality of segmentation results in terms of accuracy. The post-processing methods 
used are global thresholding (fixed thresholding; threshold=127), Otsu thresholding [38], and local 
thresholding (Gaussian thresholding [39]). These post-processing methods are compared to obtain the most 
optimal method based on four-chamber segmentation. After the post-processing stage, the testing phase is 
performed. The result is validated using pixel accuracy (PA), mean accuracy (MA), and mean intersection 
over union (MIoU) [40]. The PA metric calculates the ratio between the number of pixels classified correctly 
and the total number of pixels. To illustrate such a metric performance, the confusion matrix model is used in 
the case of classification. The pixels accuracy is shown in (3): 


Nas 
b x=l N 


7 Nels Nets NJ 
pas yal 


where, Nes is number of class and N,y is number of pixels in x class that were predicted as y class. The value 
of the confusion matrix illustrated false positives (Nw), false negatives (Nyx), true positives (Nyx), and true 
negatives (Nyy). Mean accuracy (MA) is a metric for calculating the accuracy ratio for each class and the 
average value based on all classes (Nas). MA is described in (4). 


PA (3) 
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The mean intersection over union (MIoU) is also known as the Jaccard index. This is a metric used 
to calculate the intersection percentage between the labelled mask and the predicted output. Intersection over 
Union is counted per class, and the values of all classes are averaged. The IoU metric is highly effective and 
very straightforward. MIoU is presented in (5): 


i N 
MIoU =— 


Nats N eas 
Nutz Gal es pee -N 


3. RESULTS AND ANALYSIS 

In this study, we split the data into two groups of datasets, namely training and testing, with a 
proportion of 80 and 20, respectively. Train set consist of 413 images is used to train the U-Net architecture, 
while the test set consist of 106 images is used to measure the segmentation performance. We also tested 
several post-processing methods (fix threshold, Otsu and Gaussian) to improve the segmentation result. 
U-Net original architecture with sigmoid activation function in the last layer and mean squared error loss 
function is used to create a baseline model. Table | present the segmentation performances of the original 
model. 


Table 1. Performance result of baseline model 


Metrics Without post-processing Fix threshold Otsu Gaussian 
Pixel Accuracy [%] 80.266 94.919 93.992 12.495 
Mean IoU [%] 43.660 73.458 74.007 6.645 
Mean Accuracy [%] 43.743 81.391 87.692 43.385 


From Table 1 it can infer that fix and Otsu thresholding methods increased the segmentation 
performance. Unfortunately, gaussian thresholding was failed due to its characteristic for determining 
threshold value. Fix and Otsu thresholding used one global value as a threshold, while the gaussian threshold 
determined threshold values based on a small region around it. The gaussian threshold enabled different 
threshold for different regions, which gives better results for image with varying illumination. Figure 4 shows 
the segmentation results of the baseline model. 








(a) (b) (c) (d) 


Figure 4. Segmentation with result normal, ASD, and VSD, (a) no post-processing, (b) fix thresholding, 
(c) Otsu thresholding, (d) Gaussian thresholding 


In order to get the best parameters, several tuning in U-Net hyperparameter was done. We compared 
the segmentation result using binary cross-entropy as loss function and change numbers of convolutional 
filter in each layer. The size of convolution filters was down and up sample. Typically, U-Net architecture 
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has sets of 64, 128, 256, 512, and 1024 convolutional filters in each encoder and decoder path. The 
comparison results of hyperparameter tuning and baseline model of U-Net architecture illustrated in Table 2. 


Table 2. U-Net hyperparameter comparison 
Model Metrics Fix threshold Otsu 


U-Net Baseline model (sigmoid and mean squared error) pine Acenta y [o] 94.91 93.99 


(64, 128, 256, 512, 1024, 512, 256, 128, 64) ea ae ibe ee 

U-Net with sigmoid and binary cross-entropy sree ‘i AS oe 

(64, 128, 256, 512, 1024, 512, 256, 128, 64) Mean A Tr o. 

U-Net with sigmoid and binary cross-entropy (Down sampled) pon En a 
(8,16,32,64128,64,32,16,8) Menace (a ae are 

U-Net with sigmoid and binary cross-entropy (Down sampled) ee ns a 
(16,32,64,128,256,128,64,32,16) ean hea ia ee 3 

U-Net with sigmoid and binary cross-entropy (Down sampled) eee ee ee 
(32,64, 128,256,512,256,128,64,32) Mean Accuracy i ry ee 

U-Net with sigmoid and binary cross-entropy (Up sampled) arenes ‘i ee ae 
(128,256,512, 1024,2064, 1024,512,256,128) fe Aca A re A 


Table 2 shows that binary cross-entropy loss function increases the performance metrics. The most 
noticeable escalation is the average IOU, which increased by more than 20% and the average accuracy 
increasing by about 15%. Moreover, the effect of filters numbers on the convolution layer can also be seen 
from the experimental results. The optimal value for convolutional filter was 64, 128, 256, 512 and 1024, 
both in encoder and decoder path with 99.48% of Pixel Accuracy, 94.92% of Mean IoU, and 96.73 of Mean 
Accuracy. In addition, Fix and Otsu threshold as post-processing methods only gives a slight difference for 
the evaluation metrics. Figure 5 illustrates the best model segmentation result, U-Net with 64, 128, 256, 512 
and 1024 convolutional filter and Otsu thresholding. Figure 6 shows graphs of the accuracy and loss in the 
training process. It can be seen that, the loss curve decreased to zero, and the accuracy curve increased to 1.0. 
Furthermore, there is no gap between the training and validation data curves which indicates that there is no 
overfitting problem in the proposed architecture. 





(b) 


Figure 5. Segmentation result of best model, (a) ASD, (b) normal, (c) VSD 


In Table 3, the proposed architecture is compared with other segmentation methods. We have found 
limited segmentation methods using DL for fetal cardiac studies. A number of previous studies calculate the 
segmentation performance using conventional methods without the learning process. Our proposed model 
produced 99.48% of pixel accuracy, 94.92% of mean IoU, and 96.73 of mean accuracy. Moreover, the error 
rate only produced about 0.21%. The best model gives satisfactory result compared to others even with a 
very limited data. 
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Figure 6. These figures are, (a) accuracy from the best model, (b) loss from the best model 


Table 3. The proposed model compared with other methods for fetal cardiac segmentation 


Researchers Method Number of data Metric Value 
Hausdorff Distance HD (mm): 
(HD) 1.2648 


Dynamic Convolutional Neural 


Fetal left ventricle [41] Networks 8000 Mean Absolute MAD (mm): 
Distance (MAD) 0.2016 
Dice coefficient Dice: 94.5% 
Characterization of the fetal heart [42] E A 2178 images Error rate 23.48% 
Networks -VGG 16 stride 
Apical Pour- hamber VAN Cascaded U-nets (CU-net) 1712 images Pixel Accuracy 92.9% 
Segmentation [43] 
Cascaded CNN for fetal apical 4 DW-Net (Cascaded . , 
chamber view segmentation [44] convolutional neuralnetwork) Bea images Pael Acouracy aaa 
Convolutional Neural fy lee ey o 
Our Proposed Method Networks-U-Net and Otsu 519 images slew 
Mean Accuracy 96.73% 
threshold 
Error rate 0.21% 


4. CONCLUSION 

The diagnosis of CHDs in the fetus is a challenging task. This is due to the small size of the fetal 
cardiac structure and the lack of data availability. An ultrasonography video observation from three maternity 
in the gestational age range of 18-23 weeks in format MP4 files used as the data set with cardiologist 
validation. However, the raw data need to be pre-processed due to the unstructured dimension and the low 
signal-to-noise ratio. Therefore, the limited data and the fetal heart small structure are impediments to a deep 
investigation. A deep learning approach is proposed to help experts in diagnosing CHDs. The CNN-based U- 
Net architecture helped the novice and expert sonographers identify the fetal cardiac standard planes. The U- 
Net architecture selected the base architecture combined with several post-processing methods. The network 
was trained and tested on a large number of data sets acquired. From the preliminary results, the 
segmentation performances based on PA, MA, and MIoU is observed. U-Net combined with the global 
thresholding approach (fix and Otsu methods) produces the best performance. On the other hand, local 
thresholding gives unsatisfactory results due to its ability to have a different threshold for different regions. 
From our preliminary results, we observe that, based on performance metrics such as accuracy and error, our 
network produced a comparable result with state-of-the-art techniques with 99.48% PA, 96.73% MA, 
94.92% MIoU, and 0.21 segmentation error. As part of future work, we plan to test the network performance 
on a larger data set direct from several fetal subjects. Furthermore, we will try to detect and classify structural 
anomalies in the fetal heart. Based on the retrieved slices, classify the volume as normal or abnormal for 
various types of CHDs and extend the work to extract standard planes associated with other anatomical 
Structures. 
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